Determine the best length of the history of your timeseries data using the %TS_HISTORY_CHECK macro

3 Likes

This article introduces the %TS_HISTORY_CHECK macro which analyses the influence of the length of the available data history of a time series on the quality of the forecast for future periods.

Additionally, for each of the time series and the different simulation repetitions per time series, the length of the time series that delivers the best forecast quality is selected. Based on these results, the distribution of the optimal length of the time history over all-time series examples can be seen in more detail.

The rationale of the article and the work on this macro is that analysts usually invest a lot of time in finetuning their forecasting models. However little focus is given to the, positive and negative, influence of the length of the time series. Not all models and model parameters handle the (downweighing) of "old" data in a sufficient way.

Long or short data history?

This allows you to gain a better understanding on the influence on the length of your timeseries on forecast accuracy. There are cases which correspond to intuition that, on average, a longer time history improves forecast quality. However there are also cases where the individual optimal lengths of the time history can also be very short (from 1 to 12 months). This is especially true for time series where the underlying patterns change frequently and quickly and, therefore, the more recent history is more important than the older history.

The author published an article on medium.com "Determining the best length of the history of your timeseries data for timeseries forecasting" which shows the business results and business interpretation of this analysis. Find more links in the last section of this article.

Two macros are presented here.

Macro %TS_HISTORY_CHECK that uses PROC HPFENGINE and requires SAS High-Performance Forecasting
Macro %TS_HISTORY_CHECK_ESM that used PROC ESM and requires SAS/ETS or SAS Econometrics

Both macros are used on two different dataset to illustrate the usage and the interpretation of the findings.

The classic SASHELP.AIR dataset with one time series on a monthly basis and 144 time points.
Data from a retail company: 16 different time series with a length of 170 are available here.

Simulation procedure

The simulation procedure is as follows:

For all time series that are available for analysis, the time history is truncated to the length of 1. Based on this 1 value data, a forecast for the next 12 periods is performed and evaluated against the true (untouched) data from the out-of-sample data.
In the next step, the available time history is increased backwards to the last 2 months. Again, the forecast quality is assessed on the next 12 months.
This procedure is iterated by adding an additional historic month to each iteration until a maximum data history of 48 months is reached.

This procedure is also iterated for different start points of the time series. This means that the above setup is “shifted” in the time series to avoid studying results which only depend on one start point.

Use Case 1 - SASHELP.AIR

Preparation of the data

An example call of the macro is shown based on the SASHELP.AIR data.

As the macro requires finding time series ID for each time series, this variable has to be provided, even if the data only contain one time series.

data air;
 set sashelp.air;
 tsid = 1;
run;

Option 1 - %TS_HISTORY_CHECK (requires the HPFENGINE procedure)

You can invoke the macro on the SASHELP.AIR data as follows

%ts_history_check(data=air,tsid=tsid,y=air,
                  timeid=date,interval=month,
                  minhist=2,maxhist=48,
                  shiftfrom=0,shiftto=4,shiftby=1,
                  periodvalid=12,
                  mrep=sashelp.hpfdflt,sellist=tsfsselect, 
                  stat=mape,aggrstat=median);

The output shows a linechart. The X-axis represents the respective number of available history months. The Y-axis represents the median MAPE value over all-time series and all shift scenarios.

You see that especially in the first 12-18 months additional history information decreases the forecast error. You also see that the median MAPE however slightly increases with a longer time history after ~2 years, indicating that a longer history is not necessarily beneficial for forecasting on this data.

The data generated for the above analysis are also used to display a barchart of history lengths which led to the smallest forecast error. You see a cumulation of values between 15-18 months. Note that there is only one time series (the data from SASHELP.AIR), however as you specified a shift from 0-4 by 1, you receive 5 results as the timeseries is used more than once in different points of the history.

Option 2 - %TS_HISTORY_CHECK_ESM (based on the ESM procedure of SAS/ETS or SAS Econometrics)

The advantage of the HPFENGINE procedure is that is automatically evaluates a large list of possible timeseries models (ARIMA, ExpSmoothing, ...) . If you however do not have the HPFENGINE available, you can use the %TS_HISTORY_CHECK_ESM macro which uses exponential smoothing models.

%ts_history_check_esm(data=air,tsid=tsid,y=air,
                  timeid=date,interval=month,
                  minhist=2,maxhist=48,
                  shiftfrom=0,shiftto=4,shiftby=1,
                  periodvalid=12,
                  stat=mape,aggrstat=median);

You receive a similar output as above. However as only a limited set of forecasting models is used here the MAPE level is higher as above. You also see that if you only have ExpSmoothing models available, a longer time history is still beneficial. You also see the steep drop as soon as a full year of data history is available in the data (12, 24, 36, ... months).

The same result is seen in the barchart of history lengths which led to the smallest forecast error, where longer time series are beneficial.

Usage tip: Handling of log content and occurrence of errors

As the macro loops over SHIFTS and TIME_HISTORIES and large amount of notes and warnings might be written to the SAS log. This can lead to the fact that the log fills up quickly.

You can use different options to turn of and turn on the generation of NOTES in the log or the printing of the macro code.

options nonotes nomprint ;

%ts_history_check_ESM(data=timeseries_retail,tsid=tsid,y=sales, ....... );

options notes mprint;

Another option is to pipe the log content into a file. Here you can use the PRINTTO procedure or specify an ALTLOG in your session.

proc printto log = "c:\tmp\logfile.txt";
run;

%TS_HISTORY_CHECK(...);

proc printto log=log;
run;

It can also happen that the macro execution shows an error, e.g. when you try to fit a certain timeseries model type with not enough data. This does not prevent the macro from functioning correctly. However you will see the error in the Logfile.

Use Case 2 - Retail Data (TIMESERIES_RETAIL)

Option 1 - %TS_HISTORY_CHECK (requires the HPFENGINE procedure)

Again you use the %TS_HISTORY_CHECK macro. Data TIMESERIES_RETAIL contain weekly retail sales data for 170 weeks. The validation period is set to 13 weeks (one quarter) and the time history is iterated from 2 - 130 weeks.

%ts_history_check(data=timeseries_retail,tsid=tsid,y=sales,
                  timeid=date,interval=week,
                  minhist=2,maxhist=130,
                  shiftfrom=0,shiftto=42,shiftby=7,
                  periodvalid=13, 
                  mrep=sashelp.hpfdflt,sellist=tsfsselect, 
                  stat=mape,aggrstat=median);

For the TIMESERIES_RETAIL data you see an interesting result below.

a data history from 2 - 35 weeks shows a MAPE of 0.14
increasing the data history to 50 weeks and more increases the MAPE.
only as soon as you have 90 weeks and more in your data the MAPE drops to ~0.12.

A similar picture can be seen with the

8 HPF Retail best length.PNG

A note on WEEKLY data and the WEEK interval - you probably need the WEEK.2 interval

By default, the first day in a week in the SAS System is Sunday. If you provide weekly data where the date variable is aligned to the Monday as first day of the week (which is more common in European countries) you need to be careful how your data is aligned in the forecasting procedures.

If you use the INTERVAL=WEEK your forecast values will be again be aligned to a "standard week" with a start on Sunday. You might get troubles, joining your data back to the original data (start on Monday) or when you provide them to other systems where they shall be processed further (they maybe expect a timestamp for Monday). In such a case you have to use the WEEK.2 interval value. This tells the system to you want to have weeks that only start on the 2nd day (Monday) and the alignment works.

Your turn! - Perform a pre-analysis on your time series data.

The above result is of course not generalizable on other data. However it shows, that it makes sense to study the influence of the length data history on the forecast error with your individual data to have a better view on the importance of a long or short data history.

Option 2 - %TS_HISTORY_CHECK_ESM (based on the ESM procedure of SAS/ETS or SAS Econometrics)

Fore completeness the case where not HPFENGINE procedure is available is shown as well.

%ts_history_check_ESM(data=timeseries_retail,tsid=tsid,y=sales,
                  timeid=date,interval=week,
                  minhist=2,maxhist=130,
                  shiftfrom=0,shiftto=42,shiftby=7,
                  periodvalid=13, 
                  stat=mape,aggrstat=median);

Available parameters for the macro

Parameters for macro %TS_HISTORY_CHECK

The following parameters can be specified with the macro %TS_HISTORY_CHECK:

DATA: Name of the data set that contains the time-series data. Note that the data need to be in a longitudinal format and that there needs to be an ID variable that identifies the time series.
TSID: Name of the variable that contains the time-series ID variable. Default = TSID.
Y: Name of the variable that contains the dependent variable of the time series. Default = Y.
TIMEID: Name of the variable that contains the time values of the variable.
INTERVAL: Time interval definition of the TIMEID variable; for example, MONTH, WEEK, ... Default = MONTH.
MINHIST: Minimum history length that will be tested.
MAXHIST: Maximum history length that willl be tested.
SHIFTFROM: Shift start value that will be used to run the scenario with different subsets of the data. Default = 0.
SHIFTTO: Shift end value that will be used to run the scenario with different subsets of the data. Set this value to the SHIFTFROM value to only run on shift iteration. For example 0 for a non-shifted scenario. Default = 12.
SHIFTBY: Shift-BY value for SHIFTFROM and SHIFTTO.
PERIODVALID: Number of (future) periods that shall be used to validate the forecasting results. Default = 12.
MREP: Name of the SAS catalog that contains the model repository that is used by PROC HPFENGINE. Specify with library name. Default = SASHELP.HPFDFLT.
SELLIST: Name of the selection list in MPRE, which contains the list of forecasting models that shall be used by PROC HPFENGINE.
STAT: Statistic that shall be used for model selection in HPFENGINE and calculation of the validation statistics. Valid values are MAPE and RMSE. Default = MAPE.
AGGRSTAT: Aggregation statistic that is used to aggregate the results over shifts and time series. Refer to the documentation of the means procedure for the possible values. Default = MEDIAN

Parameters for macro %TS_HISTORY_CHECK_ESM

The following parameters are additionally available with the %ts_history_check_ESM macro:

SEASONALITY: Specifies the length of the seasonal cycle. For example, 12 for a season with 12 months. Default = 12.
MODEL: Specifies the forecasting model to be used to forecast the time series. Possible values include WINTERS, ADDWINTERS, SIMPLE, SEASONAL. Refer to the documentation of the ESM procedure for a complete list of possible values. Default = SEASONAL.

Note that the SELLIST and the MREP parameter are not available with the %TS_HISTORY_CHECK_ESM macro.

A rough overview on the implementation of the %TS_History_Check macro

The macro first prepares the input data by creating a surrogate LEAD variable which has negative values for the available history. Time series which have a length which is shorter then the requirement length based on the parameters (&MAXHIST, &SHIFTTO, &PERIODVALID) are removed from the analysis.

The macro has an outer loop on SHIFTS and an inner loop on available time histories.

%do shift = &shiftfrom. %to &shiftto. %by &shiftby.;
  %do history = &minhist. %to &maxhist.;

In each run data are prepared accordingly to the parameters and only the defined subset of variables is provided to the analysis.

			  data _Hist_check_tmp_input_;
			   set _Hist_check_tmp_(where = (_count_obs_ >= &minHist.));
			   _y_valid_ = &y.;
			   if _lead_*(-1) <= (&periodvalid + &shift.) then &y.       = .;    
			   if _lead_*(-1) > (&periodvalid + &shift. + &history) then delete;

PROC HPFENGINE and PROC ESM are used to generate the forecasts in the two macros. They can also be replaced by any other forecasting procedure, such as PROC ARIMA, as long as an output is produced that corresponds to the output table structure of PROC HPFENGINE.

Depending on the available SAS modules you can run the macro with an SAS/ETS or SAS Econometrics license where the ESM procedure is used.

            proc esm data=_Hist_check_tmp_input_ out=_out_
			              outfor = _Hist_check_tmp_fc_(drop = lower upper error std)	
                          seasonality=&seasonality lead=&periodvalid;
              by &tsid; 
		      id &timeid interval = &interval;
              forecast &y/model=&model;
            run;

If you also have a SAS Forecast Server license you can also use macro %TS_History_Check which uses the HPFENGINE procedure.

Then the forecast values are joined with the original data.

			proc sql;
			 create table _Hist_check_tmp_fc_xt_
			 as select a.*,b._lead_,b._y_valid_
			    from _Hist_check_tmp_fc_(drop = _name_) as a
			    left join _Hist_check_tmp_input_ as b 
			        on a.&tsid. = b.&tsid.
			       and a.&timeid. = b.&timeid.
                order by &tsid., &timeid.;
			quit;

and the MAPE an RMSE are calculated manually

			data _ape_;
			  set _Hist_check_tmp_fc_xt_;
			  _FC_Period_ = (_lead_*(-1) <= &periodvalid);
			  _APE_ = abs(predict-_y_valid_)/_y_valid_;
			  _MS_  = (predict-_y_valid_)**2;
			run;

			proc means data = _ape_(rename = 
                                     (_ape_ = mape _ms_ = _mse_)) 
                                 noprint nway;
			 class &tsid _fc_period_;
			 var mape _mse_;
			 output out = _mape_(drop=_type_ _freq_ 
                            where=(_fc_period_=1)) mean=;
			run;

			data _mape_;
			 set _mape_;
			 rmse = _mse_ ** 0.5;
			 drop _mse_;
			 shift=&shift;
			 History=&history;
			 drop _fc_period_;
			run;

This is performed for each loop and the statistics are appended to a dataset and are then displayed in graphs and tables.

Presentation and Live Demo

Links and References

medium.com "Determining the best length of the history of your timeseries data for timeseries forecasting"

There are more SAS Communities from the author with a focus on time series analysis

Webinar at Youtube: https://www.youtube.com/watch?v=-WsYvSasl9w&list=PLdMxv2SumIKsqedLBq0t_a2_6d7jZ6Akq&index=3

This example has been taken from my SAS Press book Data Quality for Analytics Using SAS see especially chapter 20, pp 260, chapter 21, pp 265, Appendix D

Download the SAS programs and SAS datasets: https://github.com/gerhard1050/Data-Quality-for-Data-Science-Using-SAS

Further books of the author in SAS Press:

Applying Data Science - Business Case Studies Using SAS
Data Preparation for Analytics Using SAS
Slide Collection: Slides Deck #102 at https://github.com/gerhard1050/DataScience-Presentations-By-Gerhard

MichelleHomes · ‎03-03-2022

Hi Gerhard,

A good question all forecasters should validate... "How long should our optimal time series be which we feed into our time series forecasting models?" An excellent article, video and macros that is highly relevant in the sporadic world we are living and the affect it has on the historical data being used for forecasting.

Cheers,

Michelle